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Summary 

This  is  the  hnal  report  of  our  project.  In  this  report,  we  will  cover  our  research  activities  in 
the  last  report  period,  as  well  as  summarize  our  achievements  for  the  whole  project. 

Our  research  effort  has  been  directed  at  developing  an  efficient  system  architecture 
and  software  tools  for  building  and  running  Dynamic  Data  Driven  Application  Systems 
(DDDAS).  The  foremost  requirement  for  efficient  operation  of  DDDAS  is  flexibility  and 
efficiency  of  managing  memory  and  processing  resources  for  an  operating  environment  char¬ 
acterized  by  continuously  changing  resource  demands.  The  Fresh  Breeze  programming  model 
[1][2]  is  well  matched  to  this  application  environment. 

With  generous  support  from  AFOSR,  and  additional  support  from  NSF,  we  have  ac¬ 
complished  the  following  steps  toward  an  energy  efficient  computing  platform  that  directly 
implements  a  programming  model  well  suited  to  DDDAS. 

•  Create  the  simulation  tool  PCASim  for  cycle-accurate  modeling  of  computing  systems 
with  novel  architectural  features. 

•  Develop  simulation  models  for  the  Fresh  Breeze  system  architecture.  The  simulation 
models  are  designed  with  our  own  scalable  meta-simulator  description  language,  and 
can  be  straightforwardly  translated  into  hardware  implementation  with  future  rehne- 
ment. 

•  Design  the  funJava  programming  language  which  provides  data  streams  as  first  class 
data  objects,  making  the  language  especially  suited  for  DDDAS. 

•  Develop  the  Fresh  Breeze  compiler  for  generating  Fresh  Breeze  machine  codelets  from 
source  programs  written  in  funJava,  a  functional  programming  language. 

•  Implement  selected  test  programs  in  funJava  for  demonstrating  performance  and  en¬ 
ergy  efficiency  of  the  Fresh  Breeze  architecture. 
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•  Develop  compilation  and  program  transformation  techniques  to  translate  data-flow 
programs  into  executable  codelets  for  Fresh  Breeze  simulators. 

•  Design  an  abstraction  of  the  Mahali  system  for  controlled  collection  of  space  weather 
data  as  a  DDDAS  example,  and  express  it  in  the  Fresh  Breeze  programming  model. 

•  Study  the  modeling  of  multi  phase  turbulent  fluid  flows  from  experimental  measure¬ 
ment  and  through  numerical  simulation,  and  characterize  use  of  experimental  data  to 
adapt  details  of  the  simulation  as  a  DDDAS. 

These  steps  involve  use  of  new  approaches  to  issues  of  computer  system  design,  program¬ 
ming  and  application,  and  are  explained  briefly  in  the  following  paragraphs. 

1  Simulation  Using  PCASim 

The  available  simulation  tools  for  study  of  computer  architecture  are  mostly  limited  to  ex¬ 
ploring  variations  and  extension  of  popular  commercial  processors  such  as  Intel  x86  products 
and  MIPS.  Adapting  these  tools  for  a  processor  with  unusual  features  and  ISA  would  be 
challenging  and  impractical.  The  alternative  to  these  is  simulation  software  based  on  system 
descriptions  at  the  level  of  logic  gates  or  of  combinations  of  registers  and  combinational  logic 
blocks  (RTF).  Use  of  these  tools  incurs  the  substantial  effort  of  expressing  a  design  at  such 
a  detailed  level,  and  the  substantial  penalty  of  slow  simulation  speed. 

We  have  developed  PCASim,  a  simulation  tool  that  avoids  the  drawbacks  of  both  of 
the  above  approaches  and  provides  means  for  cycle  accurate  modeling  of  novel  system  ar¬ 
chitectures.  It  is  an  implementation  of  the  system  simulation  method  proposed  by  Randall 
Bryant [3]  based  on  modeling  the  subject  system  as  Packet  Communication  Architecture 
(PCA)  [4]. 

The  user  of  PCASim  describes  the  subject  system  as  an  interconnection  of  components 
in  which  packet  communication  is  the  only  means  of  communication  among  the  components. 
A  component  in  a  PCA  system  receives  packets  of  data  from  other  components  through  its 
input  ports,  processes  the  packets,  and,  following  a  specihed  time  interval,  sends  packets  via 
its  output  ports  to  other  components.  It  is  necessary  for  the  architect  of  the  subject  system 
for  simulation  to  specify  a  processing  duration  for  each  component  that  accurately  reflects 
the  delay  of  a  feasible  hardware  implementation  of  the  component. 

We  have  used  PCASim  to  perform  simulation  experiments  for  Fresh  Breeze  system  con- 
hgurations  with  as  many  as  128  processing  cores;  it  is  now  a  useful  and  reliable  tool  for 
novel  system  modeling  and  evaluation.  Although  PCASim  is  designed  so  that  creating  a 
multi-host  version  that  will  run  on  standard  multicore  systems  is  feasible,  we  have  found 
that  our  single-host  version  provides  sufficient  simulation  speed  to  be  adequate  for  modeling 
and  evaluating  the  Fresh  Breeze  systems  we  anticipate  studying. 

2  Research  on  PCA  Simulation 

Based  on  the  general  ideas  of  packet  communication  architecture  [3]  and  distributed  discrete 
event  simulation‘ [5] ,  Robert  Pavel  has  proposed  and  constructed  a  parallel  discrete  event 
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simulation  engine  PICASim,  a  tool  for  study  and  development  of  novel  architectures  and 
program  execution  models  at  a  reasonable  scale  prior  to  devoting  resources  toward  full-scale 
simulation  using  a  tool  such  as  PCASim.  For  users  of  PICASim,  Robert  has  developed  LADS: 
A  language  for  expressing  system  graphs,  including  but  not  limited  to  PICASim  graphs,  as 
well  as  The  LADSpiler:  A  source-to-source  compiler  that  can  optimize  and  translate  systems 
expressed  in  LADS  to  system  descriptions  needed  by  PICASim.. 

He  has  applied  this  simulation  engine  to  a  version  of  Fresh-Breeze  System  One  in  a 
way  that  allows  the  flexibility  to  evaluate  modihcations  to  the  architecture  and  to  provide 
a  framework  in  which  different  latencies  can  be  parameterized  to  determine  where  design 
changes  may  be  needed  to  the  architecture  model  to  achieve  better  performance. 

Robert  has  verified  and  demonstrated  that  his  simulation  tool  has  the  ability  to  model 
novel  execution  and  architecture  models  with  a  good  degree  of  accuracy.  He  successfully 
completed  his  Ph.D  dissertation  in  May  of  2015  [6]. 

3  Fresh  Breeze  System  Architecture 

The  Fresh  Breeze  system  architecture  supports  a  unique  model  for  all  data  objects  as  trees  of 
hxed-size  memory  chunks,  and  provides  a  built-in  scheduler  for  tasks  that  execute  codelets. 

Each  data  object  is  represented  by  a  tree  of  hxed-size  chunks  of  memory.  Each  chunk 
has  a  unique  identiher,  its  handle,  that  serves  as  a  globally  valid  means  to  locate  the  chunk 
within  the  storage  system.  Each  128-byte  chunk  is  created  and  hlled  with  data,  but  frozen 
before  being  shared  with  concurrent  tasks.  This  write-once  policy  eliminates  data  consistency 
issues  and  simplihes  memory  management  by  precluding  the  creation  of  cycles  in  the  heap 
of  chunks. 

Such  a  memory  model  provides  a  global  addressing  environment,  a  virtual  one-level  store, 
shared  by  all  user  jobs  and  all  cores  of  a  many-core  multi-user  computing  system.  It  may  be 
extended  to  cover  the  entire  online  storage,  replacing  the  separate  means  of  accessing  hies 
and  databases  in  conventional  systems. 

In  a  Fresh  Breeze  system,  the  unit  for  managing  the  processing  resource  is  the  single 
execution  of  a  codelet,  a  block  of  instructions  to  be  executed  without  interruption  once 
assigned  to  a  processing  core  by  its  task  scheduler.  A  task  executing  a  master  codelet  may 
spawn  one  or  more  worker  tasks  that  execute  work  codelets  independently.  A  key  innovation 
is  use  of  a  special  syne  chunk  to  collect  results  from  the  worker  tasks  and  spawn  a  task  to 
execute  a  specihed  continuation  codelet  when  every  worker  tasks  has  contributed  its  result  [1] . 
Through  recursive  use  of  this  scheme,  a  program  can  generate  an  arbitrary  hierarchy  of 
concurrent  tasks  corresponding  to  available  parallelism  in  the  computation  being  performed. 

Novel  features  of  the  multi-core  processing  chip  include:  (1)  Cache  memories  are  orga¬ 
nized  around  chunks  instead  of  typical  cache  lines;  (2)  Processor  registers  are  tagged  to  hag 
those  holding  handles  of  chunks;  (3)  an  Auto  Buher  provides  direct  access  to  chunks  without 
the  energy  and  chip  area  costs  of  the  tag  memory  used  in  a  conventional  level  one  cache; 
and  (4)  A  hardware  task  scheduler  implements  fast  switching  among  active  tasks  at  each 
processing  core  and  a  task  stealing  scheme  for  load  distribution  among  processing  cores. 

Our  hrst  simulation  model  (System  One)  has  four  components:  a  Processing  Unit,  a 
Memory  Unit,  an  AutoBuher  and  the  architected  Task  Scheduler.  Our  second  simulation 
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model  (System  Two)  includes  several  copies  of  SystemOne,  with  packet  switched  networks 
for  sending  memory  commands  and  receiving  memory  responses  between  the  processing  and 
memory  units. 

The  instruction  set  of  Fresh  Breeze  processing  units  is  similar  to  a  RISC  load/store 
design,  with  adaptations  for  accessing  memory  chunks  and  added  instructions  for  spawning 
and  coordinating  execution  of  tasks.  For  our  hrst  simulation  experiments  the  complete 
instructions  set  was  not  implemented.  Recent  work  has  extended  support  in  the  compiler 
and  simulation  model  to  all  Java  data  types. 

4  The  funJava  Programming  language 

The  funJava  programming  language  is  a  version  of  standard  Java,  restricted  to  a  functional 
subset.  This  simple  functional  programming  language  is  extended  by  special  classes  to 
support  data  streams  as  first  class  data  objects  [7].  To  the  best  of  our  knowledge  funJava 
is  the  first  and  only  programming  language  to  offer  this  feature.  Other  work  on  supporting 
data  streams  in  programming  languages  includes  the  Streamit  language  [8],  which  is  an 
extension  of  the  C  programming  language  using  an  enhanced  runtime  library  to  support  basic 
operations  on  stream  data.  A  recent  development  is  OpenStreams  [9],  an  enhancement  of 
Streamit  that  offers  higher  level  features  and  better  support  for  composing  stream  processing 
modules.  For  data  streams  to  be  hrst  class  objects,  a  programmer  must  be  able  to  initialize 
new  data  streams  and  terminate  others  as  an  application  system  requires.  The  greatest 
hexibility  in  stream  processing  requires  automatic  memory  management,  including  garbage 
collection,  which  Streamit  and  openStream  do  not  provide.  These  properties  make  funJava 
an  excellent  language  for  expressing  DDDAS. 

The  Fresh  Breeze  memory  model  offers  a  natural  representation  for  data  streams.  A 
stream  is  represented  by  a  chain  of  memory  chunks,  each  containing  stream  elements  and 
the  handle  of  the  next  chunk  in  the  chain.  The  funJava  language  includes  a  special  Java  class 
for  data  streams  containing  methods  for  the  create,  append,  hrst  and  rest  stream  operators. 
These  methods  will  be  compiled  into  machine  code  instructions  for  operating  on  streams  as 
sequences  of  memory  chunks. 

These  methods  are  good  for  expressing  such  processing  steps  as  hltering,  clipping,  etc. 
As  we  will  see  below,  an  essential  operation  for  a  typical  DDDAS  is  the  merging  of  streams 
to  produce  a  stream  containing  an  interleaved  sequence  of  all  elements  of  the  input  streams. 
In  contrast  to  the  other  stream  operations,  the  stream  merge  operation  is  nondeterminate  - 
separate  runs  of  a  program  containing  the  merge  operation  may  not  produce  the  same  result: 
There  are  many  possible  interleavings  of  two  streams,  any  of  which  is  a  valid  result. 

A  notable  feature  of  parallel  programs  written  in  funJava  is  that  any  program  not  con¬ 
taining  a  streamMerge  operation  is  guaranteed  to  be  determinate  -  all  runs  of  the  program 
with  the  same  input  will  produce  the  same  result.  There  can  be  no  data  races  that  could 
lead  to  anomalous  behavior. 
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5  Research  on  Programming  Model  for  Stream  Com¬ 
puting 

Applications  that  may  be  called  DDDAS  involve  both  high  performance  computing  and  real 
time  interaction  with  elements  of  an  environment.  In  addition  to  linear  algebra  operations 
using  vectors  and  matrices  of  numerical  data  DDDAS  require  means  for  expressing  and 
efficiently  executing  additional  forms  of  data  and  operations  on  them. 

•  Streams  of  data  arriving  continuously  for  processing  by  cascades  of  operations. 

•  Symbolic  commands  sent  in  messages  to  system  components. 

•  Data  storage  and  retrieval  including  directories  for  locating  named  data  objects. 

We  developed  a  new  programming  language/model  to  support  stream  processing  and  the 
programming  of  DDDAS  applications.  Here  we  will  use  a  simple  model  DDDAS  system  to 
illustrate  how  representative  features  of  DDDAS  are  expected  to  be  specihed/expressed  in 
the  programming  language.  For  the  Fresh  Breeze  project  the  language  will  be  a  restricted 
version  of  Java,  extended  with  features  for  streams  and  transactions  such  as  those  used  in 
the  examples  that  follow. 

In  our  simple  DDDAS  system,  a  number  of  sensor  components  collect  data  items  of 
interest  at  locations  in  the  system  environment.  A  stream  of  commands  from  a  central 
control  element  directs  each  sensor.  Each  sensor  sends,  following  commands,  substantial 
sequences  of  similar  data  items  to  a  processor  element.  One  would  like  to  have  the  data 
processed  expeditiously,  so  the  processor  has  an  entry  component  (like  the  waiting  area  of  a 
barber  shop)  where  jobs  await  available  resources  for  the  desired  processing.  Processed  data 
is  held  on  a  database  and  made  accessible  to  research  staff  for  reviewing  and  evaluation. 

From  a  high-level  point-of-view,  given  the  Fresh  Breeze  memory  model  in  which  data 
objects  are  represented  by  trees  of  hxed-size  chunks,  the  natural  choice  for  streams  is  a 
linear  chain  of  chunks.  Each  chunk  will  hold  several  data  items  and  a  reference  to  the  next 
chunk  in  the  chain.  Stream  elements  are  removed  from  the  head  chunk  of  the  chain,  and 
new  elements  are  added  to  the  tail  chunk,  adding  a  new  chunk  to  the  chain  if  the  tail  chunk 
is  full.  Moreover,  we  provide  the  synchronization  mechanism  in  stream  processing  with  a 
special  object  known  as  a  future.  The  four  operations  on  streams:  new.  Erst,  rest,  and 
append,  are  supported  by  instructions  of  the  Eresh  Breeze  instruction  set. 

Figure  1  illustrates  our  programming  model  for  a  kind  of  requirement  that  is  likely  to 
be  present  in  many  DDDAS.  The  computation  pattern  includes  the  selection  of  a  stream  of 
data  for  processing  from  multiple  streams  of  data  items. 

6  The  Fresh  Breeze  Compiler 

The  Fresh  Breeze  Compiler  uses  the  standard  Java  compiler  (javac)  to  produce  class  hies 
that  are  processed  in  three  phases: 


5 

DISTRIBUTION  A:  Distribution  approved  for  public  release. 


class  TransactionManager  <RQ>  { 


private  Guard  managerGuard  ; 

private  Stream  <QueueEntry>  queueHead ; 

private  Stream  <QueueEntry>  queueTail; 

//  A  class  for  the  queue  of  requests 
class  QueueEntry  { 

RQ  request ; 

Future  <QueueEntry>  next  ; 

QueueEntry  (  RQ  request  )  { 
this . request  =  request ; 
next  =  neve  Future  <QueueEntry>  (); 

} 

} 

//  Constructor 
TransactionManager  ()  { 

queueTail  =  neve  Stream  <QueueEntry>  (); 
queueHead  =  queueTail; 

Guard  managerGuard  =  neve  Guard  (queueTail); 
processQueue  (queueHead); 

} 

public  void  enterRequest  (  RQ  request  )  { 

RequestEntry  newEntry  =  neve  QueueEntry  (  request  ); 
oldTail  =  managerGuard  .  guardSwap  (  newEntry  .  next ) ; 
FutureWrite  (oldTail,  newEntry); 

} 

private  void  processQueue  (Stream  <QueueEntry>  reqQueue)  { 
QueueEntry  entry  =  reqQueue .  futureRead  (); 

<RQ>  request  =  entry. request; 
request . processRequest () ; 

Future  <QueueEntry>  newHead  =  entry  .  next  ; 
processQueue  (newHead); 

} 


}  //  end  class  TransactionManager 

Figure  1:  A  Programming  Model  example — the  transaction  manager.  The  two  operations 
guardSwap  and  future  Write  in  the  enterRequest  method  perform  a  lock-free  insert  of  a  new 
request  in  the  request  queue. 


•  Convert  This  component  converts  the  Java  bytecode  of  each  method  in  a  class  hie 
into  a  data  how  graph  that  represents  the  method.  The  elements  of  the  data  how 
graphs  are  unary  and  binary  operators,  array  get  and  set  operators,  conditionals,  and 
while  or  until  loops. 

•  Transform  The  data  how  graph  of  each  method  is  analyzed  to  identify  opportunities 
for  exploiting  parallelism.  In  particular,  loops  that  perform  data  parallel  reductions  or 
array  construction  are  recognized.  Then  a  set  of  data  how  graphs  is  constructed  that 
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Figure  2:  Abstraction  of  the  Mahali  vision  as  DDDAS. 


represents  the  several  codelets  needed  to  implement  the  parallelism  of  the  method. 

•  Construct  Each  data  flow  graph  representing  a  codelet  is  converted  into  a  block  of 
machine  instructions  that  dehnes  a  codelet  ready  for  execution. 

The  Fresh  Breeze  compiler  has  two  unusual  features.  First,  it  uses  data  flow  graphs  as 
its  internal  representation  for  programs,  a  choice  that  had  been  adopted  earlier  in  compilers 
for  the  Sisal  programming  language!  [10].  Secondly,  because  code  is  generated  separately 
for  each  codelet,  and  the  size  of  each  codelets  is  relatively  small,  a  detailed  analysis  of 
each  codelet  is  performed,  then  optimized  machine  code  is  constructed,  as  reported  in  [11]. 
Thus  our  development  of  the  FrBzCompiler  represents  a  signihcant  investment  in  compiler 
functionality  that  we  expect  could  well  be  a  model  for  future  compilers  for  systems  that 
adopt  a  codelet  model  for  parallel  computation. 

7  Mahali  Controlled  Data  Collection 

The  Mahali  Project  at  the  MIT  Haystack  Observatory  [12]  is  developing  a  plan  for  monitoring 
space  weather,  the  fluctuation  of  electron  density  in  Earth’s  ionosphere  as  a  result  of  solar 
events,  and  also  from  changes  at  the  Earth’s  surface.  Data  is  gathered  from  widely  dispersed 
sensors  and  processed  by  central  computing  facilities  for  presentation  to  and  interpretation 
by  scientists.  Data  on  electron  density  is  obtained  by  measuring  properties  of  the  navigation 
signals  transmitted  by  GPS  satellites,  which  are  affected  by  their  transmission  through  the 
ionosphere. 
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Figure  3:  Simulation  and  experiment  for  multi-phase  flow  as  a  DDDAS. 

Because  the  sensors  are  deployed  in  remote  locations  they  are  powered  by  solar  cells. 
In  consequence,  the  GPS  receivers  must  be  turned  on  only  when  interesting  data  will  be 
collected  and  sufficient  energy  is  available  to  operate  them.  To  provide  for  controlling  data 
collection  and  making  inquiries  about  the  status  of  sensors,  communication  of  commands 
from  the  data  processing  center  to  the  sensors  is  needed.  It  has  been  noted  that  the  ubiquity 
of  mobile  communication  devices  offers  the  possibility  to  utilize  these  devices  to  relay  data 
and  commands  between  sensors  and  processing  centers,  thereby  avoiding  the  expense  of 
setting  up  and  operating  a  specialized  communication  network.  With  these  characteristics, 
the  Mahali  network  and  computation  nodes  form  a  fine  example  of  DDDAS. 

We  have  chosen  to  use  the  controlled  data  collection  of  the  Mahali  vision  as  inspiration 
for  a  model  DDDAS  to  demonstrate  the  benefits  of  the  Fresh  Breeze  programming  model  and 
system  architecture  for  DDDAS.  Figure  2  shows  the  Mahali  abstraction  we  are  exploring. 
Many  Sensor  Devices  send  data  to  and  receive  commands  from  a  Processing  Facility  by  way 
of  mobile  Relay  Stations.  The  activity  of  each  component  of  this  model  other  than  the  Merge 
module  can  be  expressed  as  a  computation  on  input  streams  to  produce  an  output  stream. 
The  stream  merge  operation  is  used  to  combine  streams  of  data  for  transmission  by  a  relay 
station  and  for  processing  by  the  central  processing  facility.  The  abstraction  of  Figure  2 
incorporates  ah  essential  features  of  Mahli  data  collection  and  is  a  natural  £t  to  the  Fresh 
Breeze  programming  model. 
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8  Modeling  of  multiphase  flows  using  a  DDDAS-based 
systems  approach 

Turbulent  multiphase  flows  occur  in  gas  turbines  used  for  power  generation,  aircraft  engines 
operating  in  a  polluted  environment,  as  a  result  of  explosions,  either  deliberate  or  accidental, 
and  in  other  industrial  operations.  This  work  concerns  the  efficient  and  accurate  modeling  of 
these  multiphase  flows  to  guide  the  design  of  safer  and  more  effective  equipment  for  military 
and  industrial  uses. 

The  UDel  team  (Wang  in  Computational  Multiphase  Flow  Lab  in  Mechanical  Engineering 
and  Yu  in  Computer  Graphics  Lab  in  Computer  and  Information  Sciences)  are  developing 
two  advanced  approaches  for  studying  flows  of  particle-laden  fluids.  On  one  hand,  numerical 
simulation  techniques  and  the  advent  of  increasingly  powerful  computer  systems  have  made 
fluid  simulation  at  the  mesoscopic  level  practical.  The  lattice  Boltzmann  method  is  now  a 
feasible  approach  for  these  problems.  The  other  approach  is  experimental  and  uses  a  group 
of  fast  cameras  to  collect  data  on  particle  movements  and  fluid  velocity  in  a  test  flow  field. 
This  technique  is  known  as  Plenoptic  Particle  Tracking  Velocimetry  (PPTV). 

Work  in  Wang’s  group  at  UDel  has  implemented  many  improvements  to  the  lattice 
Boltzmann  method  (LBM),  and  has  also  developed  a  new  mesoscopic  simulation  method 
known  as  the  Discrete  Unified  Gas  Kinetic  Scheme  (DUCKS).  Specific  research  activities 
include: 

1.  Using  the  LBM  code,  we  studied  different  profiles  conditioned  on  the  particle  surface 
for  forced  particle-laden  turbulence.  The  results  have  been  published  in  ASME  J. 
Fluids  Engineering. 

2.  We  investigated  a  number  of  implementation  issues  related  to  the  lattice  Boltzmann 
simulation  of  moving  particles  in  a  viscous  flow:  (a)  creation  of  missing  populations 
when  a  new  fluid  node  is  created;  (b)  the  calculation  of  force  and  torque  acting  on  the 
moving  particles,  and  (c)  numerical  instability.  We  also  explored  local  grid  refinement 
to  reduce  force  fluctuations.  These  efforts  along  with  several  benchmark  case-study 
results  have  been  reported  in  two  accepted  journal  papers. 

3.  The  major  task  in  the  past  year  was  to  publish  results  from  particle-resolved  simula¬ 
tion  of  particle-laden  turbulent  channel  flow  using  the  mesoscopic  lattice  Boltzmann 
method.  Two  journal  papers  have  been  published  along  this  line,  reporting  results 
related  to  turbulence  modulation  by  finite-size  particles,  particle  concentration  distri¬ 
bution,  and  particle  slip  velocity  on  the  channel  wall. 

4.  Another  major  effort  was  to  optimize  the  LBM  particle-resolved  code.  A  sophomore 
undergraduate  researcher  examined  each  subroutine  for  the  particulate  phase,  and  was 
able  to  reduce  the  computational  overhead  for  the  particulate  phase  from  50%  to  20%. 
In  addition,  he  explored  several  ways  of  combining  collision  and  streaming  substeps 
to  speed  up  the  flow- simulation  part  of  the  code.  Overall,  our  simulation  code  is  now 
four-time  more  efficient. 

5.  Another  exciting  accomplishment  is  the  development  of  a  second  mesoscopic  approach 
-  the  Discrete  Unified  Gas  Kinetic  Scheme  (DUCKS)  for  turbulent  flows.  Unlike  LBM, 
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DUGKS  is  based  on  a  finite- volume  formulation  of  the  Boltzmann  equation,  which 
has  two  significant  advantages:  DUGKS  can  handle  compressible  turbulence  and  even 
non-continuum  flows,  DUGKS  allows  the  use  of  a  non-uniform  grid  structure.  This  can 
greatly  expand  the  application  domains  of  our  mesoscopic  DNS  tools.  So  far,  we  have 
developed  scalable  DUGKS  codes  for  both  homogeneous  flows  and  for  single-phase 
turbulent  channel  flow. 

6.  We  studied  the  appropriate  parallelization  schemes  for  the  LBM  code.  Particularly, 
we  explored  the  use  of  GPU  (Graphic  Processing  Units)  to  accelerate  several  most 
computational  intensive  kernels  in  the  LBM  code.  We  also  experimented  generalizing 
the  parallel  computation  pattern  in  a  new  codelet  based  parallel  execution  model — 
Fresh  Breeze.  Furthermore,  we  started  the  research  on  the  interactive  visualization  of 
high-performance  simulation  code. 

7.  We  have  recently  built  a  closed-loop  water  channel  which  will  soon  be  used  to  vali¬ 
date  the  accuracy  of  a  PPTV  measurement  system.  The  first  step  will  be  to  compare 
measurements  of  single-phase  turbulent  open  channel  flow  with  data  from  LBM  sim¬ 
ulations.  The  second  step  will  be  to  introduce  solid  particles  into  the  turbulent  flow 
and  again  compare  measurement  data  with  LBM  simulations. 

Both  numerical  simulation  and  PTV  have  limitations;  however,  their  use  together  pro¬ 
vides  a  new  opportunity  for  rigorous  study  of  turbulent  multiphase  flows.  Numerical  sim¬ 
ulations  may  be  used  to  calibrate  experimental  measurements  using  PTV;  conversely,  the 
PTV  data  can  show  where  simulations  are  inaccurate,  for  example,  at  boundary  layers  of 
fluid  flow. 

Our  future  work  will  involve  this  joint  application  of  experimental  and  numerical  mod¬ 
eling  techniques  as  a  dynamic  data  driven  application  system,  as  shown  in  Figure  3.  An 
Analyzer/ Gontroller  receives  data  streams  from  PTV  cameras,  monitors  a  concurrent  nu¬ 
merical  simulation,  and  looks  for  discrepancies  between  the  experimental  data  and  computed 
results.  The  PTV  cameras  are  sent  new  commands  and  simulation  parameters  are  adjusted 
accordingly.  Results  are  presented  to  the  visualization  station  for  operators  to  review  and 
respond. 
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Ghao  Yang,  Ph.D.  student,  supported  on  the  grant.  Yang  developed  the  interfaces  be¬ 
tween  the  codelet  transformer  with  upstream  and  downstream  compiler  components.  He 
also  ported  the  transformer  between  different  generations  of  simulation,  and  helped  testing 
various  compiler  components. 

Gheng  Peng,  Ph.D.  student,  supported  on  the  grant.  Peng  worked  with  Wang  to  code 
and  debug  the  lattice  Boltzmann  method,  solved  problems  related  to  the  code,  and  presented 
results  at  the  group  and  the  IGMMES2014  meeting. 

Yuan  Zong,  Visiting  Professor  from  Ghina,  self-supported.  She  is  developing  and  test¬ 
ing  a  new  lattice  Boltzmann  scheme  that  could  be  used  with  a  rectangular  grid  structure. 
Presented  a  paper  at  the  IGMMES2014  meeting. 

Songying  Ghen,  Visiting  Professor  from  Ghina,  self-supported.  He  is  developing  a  local 
grid  rehnement  method  to  improve  the  accuracy  of  the  simulations.  He  also  presented  a 
paper  at  the  IGMMES2014  meeting. 

Yihua  Teng,  self-supported,  a  senior  visitor  from  Peking  U.  Ghina.  He  helped  develop  a 
few  benchmark  cases  for  lattice  Boltzmann  simulations. 
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